I am analyzing a trace in GTFS, a collection of CSV files. Each record of a csv file is consisting of one or more fields speparated by commas ,
. It is error-prone to read csv files by simply using [line.split(',') for line in f.readlines()]
for fields might contain commas. Therefore, I decide to switch to the Python module csv.
1. CSV format
CSV (Comma-separated values) files store tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas[1]. CSV format is the most common import and export format for spreadsheets and databases. Note that fields with embedded commas must be quoted (enclosed within double-quote characters) to avoid confusions. For instance,
stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type,parent_station
3666595,,"227, AVENUE DES NATIONS","AVENUE DE LA PYRAMIDE - 93073",48.98292708588317,2.5197046726137797,0,
The Python module csv is designed to read and write csv files, primarily including:
csv.reader(...)
andcsv.writer(...)
, read and write sequencescsv.DictReader(...)
andcsv.DictWriter(...)
, read and write data in dictionary form
2. Read and write sequences
2.1 Read a csv file into a list of lists
csv.reader(csvfile, dialect='excel', **fmtparams)
returns a reader object which will iterate over lines in the given `csvfile.
import csv
# Usage
csv.reader(csvfile, dialect='excel', **fmtparams) # Return a reader object which will iterate over lines in the given csvfile.
with open(file_preprocess, 'r') as f:
lists = [row for row in csv.reader(f, delimiter=',')] # read a csv file into a list of lists
2.2 Write a list of lists to a csv file
csv.writer(csvfile, dialect='excel', **fmtparams)
returns a writer object responsible for converting the user’s data into delimited strings on the given file-like object. csvfile
can be any object with a write()
method.
# Usage:
csv.writer(csvfile, dialect='excel', **fmtparams) # return a writer object
def write_lists_to_csv(filename, lists, fieldnames=None):
with open(out_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
# add header
if fieldnames:
writer.writerow(fieldnames)
writer.writerows(lists)
Using newline=''
is to disable universal newlines translation on all platforms.
2.3 dialect and fmtparams
(1) dialect
The value of dialect
can be one of:
excel
, the usual properties of an Excel-generated CSV fileexcel-tab
, the usual properties of an Excel-generated TAB-delimited fileunix
, the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields
(2) fmtparams
The parameter fmtparams
is described in Dialects and Formatting Parameters. To make it easier to specify the format of input and output records, specific formatting parameters are grouped together into dialects. Dialects support the following attributes:
delimiter=',' # A one-character string used to separate fields
doublequote=True # When True, a field within `quotechar` should be quoted; When False, the escapechar is used as a prefix to the quotechar
lineterminator='\r\n' # The string used to terminate lines produced by the writer. Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator.
quotechar='"' # A one-character string used to quote fields containing special characters
escapechar=QUOTE_NONE # Escaping is disable by default.
quoting='QUOTE_MINIMAL' # Controls when quotes should be generated by the writer and recognised by the reader.
skipinitialspace=False # When True, whitespace immediately following the delimiter is ignored.
strict=False # When True, raise exception Error on bad CSV input.
2. Read and write data in dictionary form
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) # Create an object that maps the information read into a dict
class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds) # Create an object that maps dictionaries onto output rows.
Appendix: file open mode
The available modes for open(...)
are:
r
open for reading (default)w
open for writing, truncating the file firstx
open for exclusive creation, failing if the file already existsa
open for writing, appending to the end of the file if it existsb
binary modet
text mode (default)+
open a disk file for updating (reading and writing)U
universal newlines mode (deprecated since3.4
)
References:
[1] Wikipedia: Comma-separated values